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Abstract — Social information networks, such as YouTube, con- 
tains traces of both explicit online interaction (such as like, 
leaving a comment, or subscribing to video feed), and latent 
interactions (such as quoting, or remixing parts of a video). 
We propose visual memes, or frequently re-posted short video 
segments, for tracking such latent video interactions at scale. 
Visual memes are extracted by scalable detection algorithms 
that we develop, with high accuracy. We further augment visual 
memes with text, via a statistical model of latent topics. We model 
content interactions on YouTube with visual memes, defining 
several measures of influence and building predictive models for 
meme popularity. Experiments are carried out on with over 2 
million video shots from tens of thousands of videos. We observe 
that a high percentage of videos contain remixed content, that 
visual memes can be explained well with the joint model of memes 
and words, the influence of traditional news media versus citizen 
journalists varies from event to event, and that viral content 
can be predicted with both influence graph features and content 
features such as keywords. 

I. Introduction 

The ease of publishing and sharing videos has outpaced 
the progress of modern search engines, collaborative tagging 
sites, and content aggregation services — leaving users to see 
only a fraction of their subject Q. This information overload 
problem is particularly prominent for linear media (such as 
audio, video, animations), where at-a-glance impressions are 
hard to develop and are often unreliable. While text-based 
information networks such as Twitter rely on retweets l24ll . 
hashtags l30l or mentions to identify influential and trending 
topics, similar functions for large video-sharing repository 
is lacking. On the other hand, video clipping or remixing 
is an essential part of the participatory culture on sites like 
YouTube (32). Nevertheless, a reliable video-based "quote" 
tracking and popularity analysis system would find utility 
in many different domains, such as brand monitoring, event 
spotting for emergency management, trend prediction, journal- 
istic content selection, better retrieval, or improved social data 
sampling systems. 

We propose visual memes, a representation for tracking 
video remix, and a tool for making sense of video "buzz". 
A meme is defined a^] a cultural unit (e.g., an idea, value, 
or pattern of behavior) that is passed from one person to 
another in social settings. We define a visual meme as a short 
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segment of video that is frequently remixed and reposted by 
more than one author. Video-making requires significant effort 
and time, therefore reposting a video meme as a deeper stamp 
of approval or awareness than simply leaving a comment, 
giving a rating, or sending a tweet. Example video memes 
are shown in Figures [T] represented in a static keyframe 
format. We can see that each meme instance is semantically 
consistent, and they often become iconic representations of 
the events from which they are produced. We develop a large- 
scale event monitoring system for video content, using generic 
text queries as a pre-filter for content collection on a given 
topic. We deploy this system for YouTube, and collect large 
video datasets over a range of topics. We then perform fast and 
accurate visual meme detection on tens of thousands of videos 
and millions of video shots. We augment visual memes with 
text using statistical topics models, we then propose a Cross- 
Modal Matching (CM 2 ) method that explains a visual meme 
with keywords. We design a graph representation for social 
interactions via visual memes, and then derive graph metrics to 
capture content influence and user roles. Furthermore, we use 
features derived from the video content and meme interactions 
to predict popular memes, with an average error within 2% on 
the volume and 17% of the lifespan. An earlier version of this 
work appeared in l36l . 

The main contributions of this work are as follows: 

• We propose visual memes as a novel tool to track large- 
scale video remixing in social media. We implement 
a scalable system that extracts all memes from over a 
million video shots, in a few hours on a single desktop 
computer. 

• We design and implement the first large-scale event- 
based social video monitoring and visual content analysis 
system. 

• We design a novel method, CM 2 , to explain visual memes 
with statistical topic models, 

• We design a graph model for social interaction via memes 
for characterizing information flow and user roles. 

• We conduct empirical analysis on several large-scale 
event datasets, producing observations about the extent of 
video remix, the popularity of memes against traditional 
metrics such as view count, and different user group roles. 

II. Related Work 

This work is related to several active research areas in social 
media analysis, video analysis and annotation. 

Social network measurement studies has tracked explicit 
social interactions on YouTube. The first YouTube measure- 
ment study fT2l characterized content category distribution 
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and exact duplicates of popular videos. Subsequent studies 
on YouTube include tracking video response actions Q using 
metadata, modeling user comments to determine interesting 
conversations (T6lL and using early video views to predict ul- 
timate popularity, characterized by view counts l34l . Quoting, 
duplication, and reposting are popular phenomena in online 
information networks. One well-known example is retweeting 
on micro-blogs |24], where users often quote the original text 
message verbatim, having little freedom for remixing and con- 
text changes within the 140 character limit. Another example 
is MemeTracker (25lL which tracks the lifecycles of popular 
phrases among blogs and news websites. Studies have shown 
that the frequency of video remix can be used as an implicit 
video quality indicator [31]. Several recent work looked at 
YouTube phenomena — Biel and Gatica-Perez [6] focused on 
individual social behavior such as non-verbal cues, while Hong 
et al. 1 20] presented content summarization by monitoring 
a query over time. Crane and Sornette 031 characterize the 
driving mechanisms of YouTube views as driven by external 
events or internal (virality). Borghol et al. [8] studied whole- 
clip video clones to quantify the timing and user factors that 
leads to the number of video views. However, none of the 
prior work has defined the unit for retweet or meme on a 
video sharing network. In terms of understanding message 
diffusion and influence on social information networks, users 
and network factors l37l the nature of the topic (30), and 
sentiment associated with the message 0. To the best of 
our knowledge, researchers has yet to demonstrate effective 
prediction at individual message level (within a topic) using 
fine-grained content features. 

Tracking near-duplicates in images and video has been a 
long-standing problem of interest. Recent foci include user- 
dependent definitions of duplicates fl4l . speeding up detection 
on image sequence, frame, or local image points (35), and 
scaling out to web- scale computations using large compute 
clusters (26). We note, however, that most prior work in this 
area is concerned with optimizing retrieval accuracy of detect- 
ing near-duplicate frames or sequences, rather than tracking 
large-scale duplication behavior. Kennedy and Chang (23) 
tracked editing and provenance of images on the web, with a 
focus on distinguishing different types of image edits and their 
ideological perspective. Our work, in comparison, tracks large- 
scale remixes using video shots, focuses on inferring social 
roles in video propagation. Annotating pictures with words 
using topic models is a popular technique (4), our unique 
insight on the CM 2 models is that nearest-neighbor pooling 
on the topic space works better than direct inference on topic 
models, likely due to the noisy nature of social media text. 

III. Visual remix and event monitoring 

A. Visual memes and online participatory culture 

We define visual memes as frequently reposted video seg- 
ments or images. Media researchers have observed that users 
tend to create "curated selections based on what they liked or 
thought was important" |32], and that remixing (or re-posting 
video segments) is an important part of the "participatory cul- 
ture" (TO) of YouTube. News event collections are particularly 
suited for studying large-scale user curation, since remixing is 



more prevalent here than on video genres designed for self- 
expression, such as video blogs. The unit of interaction appears 
to be video segments, consisting of one or a few contiguous 
shots. The remixed shots typically contain minor modifications 
that include video formatting changes (such as aspect ratio, 
color, contrast, gamma) and production edits (such as the 
superimposing text, or adding borders and transition effects). 
Most of such transformations are well-known as the targets 
of visual copy detection benchmark (29) . In this paper, meme 
refers both to individual instances, visualized as representative 
icons (as in Figure [T] Left), and to the entire equivalence 
class of re-posted near-duplicate video segments, visualized 
as clusters of keyframes (as in Figure [T] Right). 

Intuitively, re-posting is a stronger endorsement requiring 
much more effort than simply viewing, commenting on, or 
linking to the video content. A re-posted visual meme is 
an explicit statement of mutual awareness, or a relevance 
statement on a subject of mutual interest. Hence, memes can 
be used to study virality, lifetimes and timeliness, influential 
originators, and (in)equality of reference. 

B. Monitoring Events on YouTube 

We use text queries to pre-filter content, and make the scale 
of monitoring feasible. We use a number of generic, time- 
insensitive text queries as content pre-filters. The queries are 
manually designed to capture the topic theme, as well as the 
generally understood cause, phenomena, and consequences of 
the topic. For example, our queries on the "global warming" 
topic consist of global warming, climate change, green house 
gas, CO2 emission, whereas the "swine flu" topic expands into 
swine flu, H1N1, H1N1 travel advisory, swine flu vaccination, 
and so on. We aim to create queries covering the main 
invariant aspects of a topic, but automatic time- varying query 
expansion is open for future work. We use the YouTube API 
to extract video entries for each query, sorted by relevance and 
recency, respectively. The API will return up to 1000 entries 
per query, so varying the ranking criteria helps to increase 
content coverage and diversity. Then, for each unique video, 
we segment it into shots using thresholded color histogram 
differences. For each shot we randomly select and extract 
a frame as keyframe, and extract visual features from each 
keyframe. We process the metadata associated with each video, 
and extract information such as author, publish date, view 
counts, and free-text title and descriptions. We clean the free- 
text metadata using stop word removal and morphological 
normalization. Although being a noisy representation of The 
volume of retrieved and memes are telling indicators of event 
evolution in the real world, a few example trends can be found 
in our recent paper and project webpage (36lr ) 



IV. Scalable Visual Meme Detection 

Detecting visual memes in a large video collection is a non- 
trivial problem. There are two main challenges. First, remixing 
online video segments changes their visual appearance, adding 



noise as the video is edited and re-compressed (Section [TlT| >. 
Second, finding all pairs of near-duplicates by matching all N 

2 |http://cecs.anu.edu.au/^xlx/proj/visualmemes.html| 
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A visual meme cluster 
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Fig. 1. Visual meme shots and meme clusters. (Left) Two YouTube videos that share multiple different memes. Note that it is impossible to tell from 
metadata or the YouTube video page that they shared content, and that the appearance of the remixed shots (bottom row) has large variations. (Right) A 
sample of other meme keyframes corresponding to one of the meme shots, and the number of videos containing this meme over time - 193 videos in total 
between June 13 and August 11, 2009. 



shots against each other has a complexity of 0(N 2 ), which is 
infeasible for any collection containing more than one million 
video shots. 

Our solution to the first challenge is robust keyframe match- 
ing, where a keyframe is representative of a video shot, seg- 
mented using temporal feature differences. We pre-process the 
frame by removing trivial (e.g. blank) matches, detecting and 
removing internal borders; normalizing the aspect ratio; per- 
forming de-noising; and applying contrast-limited histogram 
equalization to correct for contrast and gamma differences, we 
then extract the color correlogram l2T1l feature for each frame 
to capture the local spatial correlation of pairs of colors. The 
color correlogram is designed to tolerate moderate changes in 
appearance and shape that are largely color-preserving, e.g., 
viewpoint changes, camera zoom, noise, compression, and 
to a smaller degree, shifts, crops, and aspect ratio changes. 
We also use a "cross"-layout that extracts the descriptor only 
from horizontal and vertical central image stripes, thereby 
emphasizing the center portion of the image and improves 
robustness with respect to text and logo overlay, borders, 
crops, and shifts. We extract an auto correlogram in a 166- 
dimensional perceptually quantized HSV color space, resulting 
in a 332-dimensional feature. The keyframe matching uses a 
query-adaptive threshold to normalize among the query frame, 
and among the different feature dimensions. This threshold is 
tuned on a training set. 

Our solution to the complexity challenge is to use an index- 
ing scheme for fast approximate nearest neighbor (ANN) look- 
up. We use the FLANN Library |28l to automatically select 
the best indexing structure and its appropriate parameters for a 
given dataset. Our frame features have over 300 dimensions, 
and we empirically found that setting the number of nearest- 
neighbor candidate nodes m to can approximate k- 
NN results with approximately 0.95 precision. In running 
in 0(Ny/N) time, it achieves two to three decimal orders 
of magnitude speed-up over exact nearest neighbor search. 
Furthermore, each FLANN query results in an incomplete 
set of near-duplicate pairs, we perform transitive closure on 
the neighbor relationship to find equivalence classes of near- 
duplicate sets. We use an efficient set union-find algorithm that 
runs in amortized time of O(E), where E is the number of 
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Fig. 2. Flow diagram for visual meme detection method. 



matched pairs |17], which is again 0(Ny/N). 

This process for detecting video memes is outlined in 
Figure [2] The input to this system is a set of video frames, 
and the output splits this set into two parts. The first part 
consists of a number of meme clusters, where frames in the 
same cluster are considered near-duplicates with each other. 
The second part consists of the rest of the frames that are not 
considered near-duplicates with any other frame. Blocks A and 
D address the robust matching challenge using correlogram 
features and query-adaptive thresholding, and blocks B, C and 
E address the scalability challenge using approximate nearest- 
neighbor (ANN) indexing. A few examples of identified near- 
duplicate sets are shown in Figure pi Visual meme detection 



performance is evaluated in Section |VIII-A 
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Our design choices for the visual meme detection system 
aim to find a favorable combination of accuracy and speed 
feasible to implement in one single PC. Note that local image 
points or video sequences l35l tend to be accurate in each 
query, but is not easy to scale to N 2 matches. We found that a 
single video shot is a suitable unit to capture community video 
remixing, and that matching by video keyframe is amenable to 
building fast indexing structures. The ANN indexing scheme 
we adopt scales to several million video shots. On collections 
about tens of millions to billions video shots, we expect that 
the computation infrastructure will need to change, such as 
using a data center to implement a massively distributed tree 
structure l26l and/or hybrid tree-hashing techniques. 
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V. Topic representation of memes 

Visual memes are iconic representations for an event, it will 
be desirable to augment its image-only representation with 
textual explanations. It is easy to see that the title and text 
descriptions associated with many online video can be used 
for this purpose, despite the noisy and imprecise nature of the 
textual content. We propose to build a latent topic model over 
both the visual memes and available text descriptions, in order 
to derive a concise representation of videos using memes, and 
to facilitate applications such as annotation and retrieval. 

Our model treats each video as a document, in which 
visual memes are "visual words" and the annotations are text 
words. By building a latent topic space for video document 
collections, we embed the high-dimensional bag-of- words into 
a more concise and semantically meaningful topic space. 
Specifically, we learn a set of topics z = 1 , . . . , K on the 
multimedia document collection V = {d rn , m = 1, . . . , M} 
using latent Dirichlet Allocation (LDA) [7]. LDA models each 
document as a mixture of topics with a document-dependent 
Dirichlet prior, each topic drawn from the resulting multi- 
nomial, and each word drawn from a topic-dependent multi- 
nomial distribution. Our LDA model combines two types 
of words, i.e., visual memes and text words, into a single 
vocabulary V = {V Vj Vt}, and estimates the conditional 
distribution of words given topics from a video collection. 
Mathematically, each video d m is represented as a bag of 
words, w rn = {wi}^{, is modeled as follows, 

P(dm\a) = J [] [^Piwilzi^PizilOyPiOWM (1) 

where Zi is the topic indicator variable for word Wi, 6 is 
the latent topic distribution, and a is the hyperparameter 
controlling the topic prior at the corpus level. 

Given the topic model, we can project a set of visual 
words (or text tags) into the learned topic space by computing 
the posterior of the topic distribution conditioned on those 
words. Let the observed words are w , we map w to the 
mode 6 of the topic posterior: 

= axgmax ]J ^ P(wi\zi)P(ziO, $)P(0\a) (2) 

where the parameters a, are estimated from training data. 
The inference in model learning and posterior calculation are 
conducted with variational EM (e.g., see [7] for details). 

A. Cross -modal matching (CM 2 ) with topics 

In social media, some words and names may be unseen 
in previous events (e.g. entekhabat, "election" in Persian), 
and iconic visual memes may appear without clear context of 
emergence. For a better understanding of these novel events, a 
particular useful step is to build association between different 
modalities, such as texts and visual memes. We pose this as 
a cross-modal matching problem, and aim to estimate how 
well a textual or visual word (candidate result w r ) can explain 
another set of words (query w q = {w qn }). This is achieved 
by estimating the conditional probability of seeing w r given 
that w q is in the document, i.e., p(w r \w q , D). We call this 



estimation process Cross-Modal Matching (CM 2 ), and propose 
its application for content annotation and retrieval. 

A derivation sketch for CM 2 is as follows, under the context 
of document collection D and the topic model {a, <£}. We 
consider modeling the conditional word distribution through 
topic representation. Without loss of generality, we assume 
the query consists of visual memes and predict the probability 
of each text tag. The first step of our method is to compute the 
topic posteriors of the document collection D = {d m } given 
the query modality. Let w be the observed visual memes in 
each document d m , we estimate the topic posterior mode m 
from Equation [2j Thus the whole document collection can be 
represented as 6 = {0 m }. 

Given a new query w q , we also embed it into the topic 
space by computing its topic posterior mode: 

g = axgmax ]J ^PK|^M,$W|a) 

Intuitively, we want to use "similar" videos in the topic space 
to predict the text tag probability. Formally, the conditional 
probability of a text tag w r is estimated by a voting scheme 
as follows, 

PK|w,,D)cxW [^ = w r })e-^- 9 ^^ (3) 

where gq controls the similarity of topic vectors. 

A baseline method based on word co-occurrence can esti- 
mate the conditional probability with co-ocurrence counting: 

P(w r \w q , D) CX ^ ^ i W i = W A A l W q n ^ W q] 

m \w i ed rn ,q n ) 

(4) 

Examining the estimation equations Q-Q, we note that 
CM 2 can be interpreted as a soft co-occurrence measure for 
(w r ,w q ) over the entire document collection with the topic 
model. In a sense, co-occurrence counting is a special case, 
where the counts are weighted by the number of documents 
in which w q appeared. 

B. CM 2 applications 

CM 2 has several applications depending on the choice of w q 
and w r . Such as (1) Visual Meme/video annotation - We use 
visual memes as queries, w q C V v , and return the top entries 
of w r G Vu sorted by p(w r \w q , D). The motivation of this 
task for event monitoring is that the keywords are often spe- 
cialized subjective, semantic, and non-visual, e.g freedom. (2) 
Keyword illustration - We can illustrate a keyword (e.g. H1N1) 
with a set of most-related images. We take w q G Vt, and yield 
the top entries of w r G V V9 sorted by p(w r \w q ,D). In this 
paper, we focus on application (1) and leave the others for 
future exploration. 

VI. Graphs on Visual Memes 

Visual memes can be seen as implicit links between videos 
and their creators that share the same unit of visual expression. 
We construct graph representations for visual memes and users 
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who create them. This gives us a novel way to quantify 
influence and the importance of content and users in this video- 
centric information network. 

Denote a video (or multimedia document) as d m in event 
collection V, with m = l,...,iV. Each video is authored 
(i.e., uploaded) by author a(d m ) at time t(d m ), with a(d m ) 
taking its value from a set of authors A = {a r: r = 1, . . . , R}. 
Each video document d m contains a collection of memes, 
{i>i, i>2, • • • , VK m } from a meme dictionary V. In this network 
model, each meme induces a time- sensitive edge e mj with 
creation time t(e m j), where m,j are over video documents or 
authors. 

A. Meme video graph 

We define the video graph G = {V, Eg}, with nodes d G V. 
There is a directed edge e mj E £g, if documents <i m and en- 
snare at least one visual meme, and if d m precedes dj in 
time, with t(d m ) < t(dj). The presence of e mj represents a 
probability that dj was derived from d m , even though there 
is no conclusive evidence within the video collection alone 
whether or not this is true. We denote the number of shared 
visual memes as v m j = \d m f) dj\, and the time elapsed 
between the posting time of the two videos as At jm = t(dj) — 
t{d m ). 

We use two recipes for computing the edge weight oj m j. 
Equation [5] uses a weight proportional to the number of 
common memes u m j , and Equation [6] scales this weight by 
a power-law memory factor related to the time difference 
Atj m . The first model is insensitive to At jm , so it can 
accommodate the resurgence of popular memes, as seen in 
textual memes [25]. The power law decay comes from known 
behaviors on YouTube fT5l , and it also agrees with our 
observations on the recency of the content returned by the 
YouTube search API. 

u* m j = Vmj (m,j)e£ G (5) 

U'mj = "rnjAtJ^ (6) 

We estimate the exponent r] to be 0.7654, using a subset of 
our data, over ten different topics retrieved over 24 hours of 
time. 

B. Meme author graph 

We define an author graph H = {A,£h}, with eacn author 
a G A as nodes. There is an undirected edge e rs G Eh, if 
authors a r and ctj share at least one visual meme in any video 
that they upload in the event collection. 

We compute the edge weights 6 rs on edge e rs as the 
aggregation of the weights on all the edges in the video graph 
G connecting documents authored by a r and a s . 

rs = ^{i 1 a(d m )=a r }^{j 1 a(d j )=a B } u; mj (7) 

with r, s G A, m, j G V. We adopt two simplifying assump- 
tions in this definition. The set of edges Eh are bidirectional, 
as authors often repost memes from each other at different 
times. The edge weights are cumulative over time, because 
in our datasets most authors post no more than a handful of 
videos (Figure [7]), and there is rarely enough data to estimate 
instantaneous activities. 



C. Meme influence indices 

We define three indices based on meme graphs, which 
captures the influence on information diffusion among memes, 
and in turn quantifies the impact of content and of authors 
within the video sharing information network. 

First, for each visual meme v, we extract from the event 
collection V the subcollection containing all videos that have 
at least one shot matching meme v, denoted as V v = {dj G 
V, s.t. v G dj}. We use V v to extract the corresponding video 
document subgraph G v and its edges, setting all edge weights 
v in G v to 1 since only a single meme is involved. We compute 
the in-degree and out-degree of every video d m in V v as the 
number of videos preceding and following d m in time: 

C iV = Xjlidm, dj G V v , t(dj) < t(d m )} (8) 
CZ' V = ^jl{dm, dj G V v , t(dj) > t(d m )} 

where /{•} is the indicator function that takes a value of 1 
when its argument is true, and otherwise. Intuitively, £m 
is the number of videos with meme v that precede video 
d m (potential sources), and is the number of videos that 
succeed meme v after video d m (potential followers). 

The video influence index Xm is defined for each video 
document d m as the smoothed ratio of its out-degree over 
its in-degree, aggregated over all meme subgraphs G v (Equa- 
tion [9]); the smoothing factor 1 in the denominator accounts 
for d m itself). The author influence index \ r is obtained by 
aggregating Xm over all videos from author a r (Equation [TO]). 
The normalized author influence index \r is its un-normalized 
counterpart \r divided by the number of videos an author 
posted, which can be interpreted as the average influence of 
all videos for this author. 

/■out 



Xr = ^{i,a(d m )=a r } Xm, (10) 
- = Xr 

T, m I{a(dm) a r } 

The influence indexes above captures two aspects in meme 
diffusion: the volume of memes, and how "early" a video 
or an author is in the diffusion chain. The first aspect is 
similar to the reweet and mention measures recently reported 
on Twitter ifTTl . The timing aspect in diffusion is new, and 
it is designed to capture different roles that users play on 
Youtube, such as information connectors and mavens fT9ll . 
Here connectors refer to people who come "... with a special 
gift for bringing the world together, . . . [an] ability to span 
many different worlds", and mavens are "people we rely upon 
to connect us with new information, . . . [those who start] word- 
of-mouth epidemics". 

VII. Predicting meme importance 

One long-standing problem in social media is on predict- 
ing the popularity of social memes [19]. Studies on social 
meme adoption and popularity has focused on URLs 0, 
hashtags (37), or view counts on YouTube O, (34). This 
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work investigate whether or not visual meme popularity is 
predictable with knowledge of both the network and content. 

Popularity, or importance on social media is inherently 
multi-dimensional, due to the rich interaction and information 
diffusion modes. For YouTube, it can be the number of times 
that a video is viewed (341 , the number of likes or favorites 
that a video has received. While these commonly-used metrics 
focus on the entire video, not a given meme, we focus on two 
targets that are inherent to visual memes: the number of times 
that a video meme is reposted by other YouTube users (denoted 
as volume), or by the lifespan (in days) of a meme (life). 

We build a meme prediction model using features related 
to its early dynamics, the network around its authors, and the 
visual and textual content. For each visual meme v that first 
appeared at time t(v) (called onset time), we compute features 
on the meme- and author- sub-graphs up to t\ = t(v) + At, 
by including video nodes that appeared before t\. At is set to 
one day in this work, to capture meme early dynamics, similar 
to what has been used for view-count prediction l34ll . 

These features are: (1) The volume of memes up to t\. (2) 
Static network features of author productivity and connectivity. 
We use the total number of videos that the author has uploaded 
to capture author productivity. An author's connectivity in- 
clude three metrics computed over the author graph of up to 
time ti": degree centrality l33l is the fraction of other nodes 
that a node is directly connected to; closeness centrality is 
the inverse of average path length to all other nodes; and 
betweenness centrality is the fraction of all-pairs shortest paths 
that pass through a node |9 |. (3) Dynamic features of author 
diffusion influence. These include the meme influence indices 
X r and Xr in Equation [10| as well as the aggregate in-degree 



and out-degree for each author. 

We capture the video content using bag-of- words vector and 
multimodal topics that capture the coocurrence of memes and 
words. Specifically, the bag-of- word vector for each meme v 
is the average count of each word over all videos containing v 
within the first day; and the topic vector is the posterior prob- 
ability of each topic given v inferred through the topic model 



in Section V-A For features aggregated over multiple authors, 
we take the maximum, average, and standard deviation among 
the group of authors who have posted or reposted the meme by 
t\. We use Support Vector Regression (SVR) [ 1 3 1 to predict 
meme volume and lifespan on a log-scale, using each, and the 
combination the features above. 

VIII. Experiments and Observations 

Using the targeted-querying and collection procedures de- 
scribed in Section |III-B| we downloaded video entries from 
about a few dozen topics from May 2009 to March 2010. We 
used the following four sets for evaluation, which had enough 
volume and change over time to report results, summarized 
in Table |l| The SwineFlu set is about the H1N1 flu epidemic. 
The Iran3 set is about Iranian domestic politics and related 
international events during the 3 -month period of summer 
2009. The Irani set is a subset of Iran3 that was collected 
during the first of the 3 months; most of its videos are about 
the election in mid- June and its associated political outbreaks. 



Topic 


#Videos 


#Authors 


#Shots 


Upload time 


SwineFlu 


31,488 


10,804 


1,202,479 


04/09^03/10 


Iran3 


23,049 


4,681 


1,255,062 


08/07^08/09 


Irani 


5,429 


2,393 


210,259 


09/07^07/09 


Housing 


2,446 


654 


71,872 


08/07^08/09 



TABLE I 

Summary of YouTube event data sets. 
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Fig. 3. Performance of visual meme detection method on Housing Dataset. 

The Housing set is about the housing market in the 2008- 
09 economic crisis; a subset of these videos were manually 
annotated and used to validate and tune the visual meme 
detection algorithms. 

We perform visual meme detection as described in Sec- 
tion |IV| We additionally filter the meme clusters identified by 
the detection system, by removing singletons belonging to a 
single video or a single author. We process words in the title 
and description of each video by morphological normalization 
with a dictionary Q, we preserve all out-of- vocabulary words, 
these include foreign words and proper names (e.g., mousavi), 
abbreviations (H1N1), or mis- spellings. We rank the words by 
tf-idf across all topics, and take the top few thousand for topic 
models, tagging, and popularity prediction. The prototype 
system is implemented in C++, Python, and MATLAB, and it 
can be deployed on one workstation requiring less than 8GB 
memory. 

A. Meme detection performance 

We evaluate the visual meme detection method described 
in Section |IV] using a test set created from the Housing 
dataset. Specifically, we run multiple versions of k-means 
clustering with a tight cluster radius threshold; we take the 
union of detected near-duplicates, and we manually go through 
a sample of clusters to explicitly mark correct and incorrect 
near-duplicates. We also manually augment the detected near- 
duplicate sets by performing visual content-based queries on 
the color correlogram feature, and mark the top returns as 
positive or negative. We specifically include many pairs that 
are being confused by the clustering and feature- similarity 
retrieval steps. The resulting data set contains ~ 15,000 
near-duplicate keyframe pairs and ~ 25,000 non-duplicate 
keyframe pairs. 
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(a) Percentage of memes 



(b) Probability that a video contains a meme 
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Fig. 4. Video reposting probabilities, (a) Fraction of visual memes. (b) Video views vs. meme 
probability on Iran3 set. 
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Fig. 5. Rank vs frequency for words and visual memes. 



We compute the near-duplicate equivalence classes as de- 
scribed in Section [iVJ and calculate precision (P) and recall 
(R) on the labeled pairs. The results are shown on Figure [3] for 
varying values of the threshold parameter r. We note that the 
performance is generally quite high with P > 95%. There are 
several possible operating points, such as P = 99.7%, R = 
73.5% for low false alarm; or P = 98.2%, R = 80.1% that 
produces the maximum Fl score of 0.88 (defined as Jq^|); or 
P = 96.6%, R = 80.7% for the highest recall. For the rest of 
our analysis, we use the last, high-recall, point with r = 11.5. 
On the Iran3 set of over 1 million shots, feature extraction 
takes around 7 hours on a quad-core CPU, and indexing and 
querying with FLANN takes 5 to 6 CPU hours. 

B. Content views and re-posting probability 

In our video collections, the behavior of remixing and 
reposting is quite dominant. Over 58% of the videos contain 
visual memes for Iran3 and 70% of the authors participate in 
remixing; likewise, 32% and 45% respectively for SwineFlu, 
as shown in Figure |?J a )- These statistics suggest that, for 
popular topics there is much less original content than re- 
mixes and reprises of existing sources. 

We observe that video popularity is a poor indicator of 
how likely it is to be re-posted, which in turn determines 
its influence. In the Iran3 set of more than 23K videos, for 
example, the 4 most popular videos have no memes and 
have nothing to do with the topic, and likewise for 7 of the 
first 10. One has to get beyond the first 1,600 most popular 
videos before the likelihood of having near-duplicates passes 
the average for the dataset, at about 0.58 (see Figure |4jb)). 
There are several reasons for this mismatch. Among the video 
entries returned by YouTube search engine, the most viewed 
are often not related to the topic; for example, the one with 
the highest view-count is a music video irrelevant to Iranian 
politics. Such videos also tend to be part of a production (i.e. 
promotion for a song), which bears lesser value for re-posting 
and re-interpretation. Moreover, it is influenced by a "rich-get- 
richer" effect due to content recommendations and promotions 
on the YouTube site. In short, popularity is a poor proxy for 
relevance or importance. 

We have noted that this is a further example of the very 
unequal distribution of views that characterizes this domain. 



To quantify the inequality of views-counts, we have computed 
the Gini coefficient lITSl of this data set, and find it to have 
the extreme value of 0.94, whether one looks at those videos 
with near-duplicates, or those without. The Gini coefficient, 
used in economics, ranges from (each video with an equal 
number of views) to 1 (one video with all of the views). The 
value we observed far exceeds the measure of inequality for 
the distribution of wealth in any known country, which has its 
maximum at about 0.7. 

We observe considerable similarity in the frequency distri- 
bution of visual memes and words. Figure [5] plots textual word 
and visual meme frequencies versus rank in log-log scale. 
Performing a regression fit, we obtain the following Zipf's 
power law distributions: 



1.959 



f(w t )(xr L ' W2 - f(w v )<xr 

The exponent s for words in the title and description is close 
to that of English words (~ 1.0). For visual memes s = 1.959, 
suggesting that the diversity of visual memes is less than that 
of words at the lower-frequency end. Still, this observation 
validates that the visual memes form a vocabulary with a 
"proper" distribution and it makes sense to model visual words 
as a "language" on YouTube, similar to those with textual 
words. 

C. Multimodal topics and CM 2 

We learn topic models on a joint vocabulary of words 
and memes. For words, We adopt a tf-logidf re- weighting 
scheme l27l across more than two dozen topics monitored 
around the same time, this is to suppress very popular words 
and yet not overly favor rare words. The visual meme vo- 
cabulary is constructed using a threshold on its frequency. In 
the following experiments, we choose 2000 text words for 
both datatsets, and 12000 visual memes for the Iran and 4000 
visual memes for SwineFlu datasets. 

We set the number of topics to be 50 by validation, and use 
the term-topic probabilities p(w\z) to label a topic, using both 
text words and visual memes. Figure [6ja) shows two example 
topics over the collections Iran3 and Swineflu, respectively. 
Topic #5 contains the keywords "neda, soltan, protest,. . . ", the 
images capturing her tragic murder and her portrait that was 
published later. The topic volume over time clearly showed 



8 



Iran topic #05: "neda de la soltan protest" 



topic volume per day 





Topic: Iran3 Topic: SwineFlu 

total diffusion influence 900 total diffusion influence — 



Swine Flu topic #15: "1976, minute, propaganda, shot" 



FLYING PIG FLU? 




(b) 



w„ = 's000013' 



e count overtime 



co-occurrence: CM 2 : 

"vaccine, 1976, congressman ..." "propaganda, 1976, ..." 



IlL, 

309-06-24 2010-01-1 



Fig. 6. (a) Topic model example, (b) Retrieval results using CM2. 



log-prob. 


Co-occurrence 


Topic Model 


Iran3 


-6.08 ± 0.06 


-5.65 ± 0.05 


SwineFlu 


-6.59 ± 0.03 


-6.54 ± 0.04 



TABLE II 

Comparison on the log-probability of tags. 



the onset and peak of the event (just after June 20th, 2009), 
and it is also clear that this event also influenced subsequent 
protest activities in July. 

We examine the CM2 model for video tagging in context. 
Here we consider using the visual memes of each video as the 
query and retrieve the tagging words using scores computed 
with Equation [3] We also implement the baseline in Equation [4] 
and look at the memes in comparison with those retrieved by 
top co-occurrence. We carried out five-fold cross-validation, 
and report the average performance based on the average log 
likelihood [4 ] of the existing tags. We did not use a precision 
or ranking metric, as tags associated with each video are sparse 
and positive-only, and many informative tags are missing in 
the data. Table [TT] shows that the average log likelihood is 
improved on both datasets, this demonstrates the advantage of 
the topic-based representation. 

Figure [6] shows example result of using one of the memes 
in the 1976 video as query. We can see that the co-occurrence 
model returns 1976, vaccine, congressman, . . . which are all 
spoken or used as description in the original 1976 government 
propaganda video, while CM2 returns 1976, propaganda, 
which was apparently from the topic above. Comparing the 
images, we can also see that the top memes returned by the 
co-occurrence model are all from the same video, since the 
parody is mostly posted by itself, with little or no remix, while 
CM2 also retrieves two memes relating to modern-day vaccine 
discussions in the news media, providing relevant context. 

The rightmost column in Figure [6] shows the temporal 
evolution of a meme (sub-figure b) and two topics (sub-figure 
a). We can see the source of the 2008 propaganda video in the 
meme evolution, it also reveals that there are multiple waves 
of remix and re-posting around the same theme. The topic 
evolutions, on the other hand, segments out sub-events from 
the broader unfolding of many themes - with Iran topic #5 
representing the murder of Neda and its subsequent influence, 
and Swine Flu #15 closely correlated to the public discussion 
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Fig. 7. Meme influence indices vs author produc tivity on topic Iran3 (Left) 
and SwineFlu (Right); detailed discussions in Sec |VIII-D] 



on vaccines. 



D. Influence index of meme authors 

We compute the diffusion index for authors according to 
Equation 10 Figure [7] contains scatter plots of the author 
influence indices on the ?/-axis, versus "author productivity", 
(number of videos produced) on the x-axis. For both the 
Iran3 topic and the SwineFlu topic, we plot the total diffusion 
influence \ r and the normalized diffusion influence \r- 

In the Iran3 topic we can see two distinct types of contrib- 
utors. We call the first contributor type maven lfT9l (marked 
in red), who are post only a few videos, but which tend to be 
massively remixed and reposted. This particular maven was 
among the first to post the murder of Neda Soltan, and one 
other instance of student murder on the street. The former 
post become the icon of the entire event timeline. We call the 
second contributor type information connectors fT9l (circled in 
green), who tend to produce a large number of videos, and who 
have high total influence factor, yet has lower influence per 
video. They aggregate notable content, and serves the role of 
bringing this content to a broader audience. (A response metric 
such as view count or number of comments could further 
confirm this fact.) We examined the YouTube channel pages 
for a few authors in this group, and they seem to be voluntary 
political activists with screennames like "iranloverlOO"; and 
we can dubb them "citizen buzz leaders". Some of their videos 
are slide shows of iconic images. Note that traditional news 
media, such as AljezeeraEnglish, AssociatedPress, and so on 
(circled in gray) have rather low influence metric for this 
topic, partially because the Iran government severely limited 
international media participation in the event; most of the event 
buzz was driven by citizens. 

However, the SwineFlu collection behaves differently in 
its influence index scatterplots. We can see a number of 
connectors on the upper right hand side of the total diffusion 
scatter. But it turns out that they are the traditional media (a 
few marked in gray), most of which have a large number (>40) 
of videos with memes. The few mavens in this topic (marked 
with green text) are less active than in the Iran topic, and 
notably they all reposted the identical old video containing 
government health propaganda for the previous outbreak of 
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swine flu in 1976. These observations suggest that it is the 
traditional new media who seem to have driven most content 
on this topic, and that serendipitous discovery of novel content 
does also exist, but has less diversity. 

These visualizations can serve as a tool to characterize 
influential users in different events. We can find user groups 
serving as mavens, or early "information specialists" fT9lL and 
connectors, who "brings the rest . . . together", and henceforth 
observe different information dissemination patterns in differ- 
ent events. 

E. Meme popularity prediction results 

We predict the lifespan of memes as described in Sec- 
tion |VII| We prune memes that appear less than 4 times, 
the remaining memes are randomly split in half for train- 
ing/testing. The Iran3 dataset has 4081 memes in each of the 
train/test split, the SwineFlu set has 398. Different features 
are filtered by a low correlation threshold (0.03) and then 
concatenated to form a joint feature space. We train support 
vector regressor fT3l by searching over hyperparameter C and 
different kernel types - linear, polynomial, and radial basis 
function. We use three different metrics to quantify regression 
performance: mean- square-error (mse), Pearson correlation 
(corr), Kendall's tau (tau) l22ll . Each regressor is trained 
with five different random splits of train/test data, the average 
performance with their standard deviation (as error wisks) is 
shown in Figure [8] 

We can see that meme graph features (connectivity and 
influence) both out-perform the volume feature volume-dl. 
Combining these three types of features (net-all) further im- 
proves prediction performance, and text keyword feature (txt) 
is single strongest predictor. The presence and absence of other 
visual memes is not stronger than text (txt+vmeme), while 
all of network, words and meme features has the strongest 
combined performance (net+txt+vnieme). The Iran3 dataset, 
with significantly more data to learn from, has better and more 
stable results than SwineFlu. From the average MSE, the pre- 
dictions for meme volume on Iran3 is within 1.7% (VlO- 015 ) 
and 16.1% (VlO 13 ) for meme lifespan. In Figure 9 we 
examine the top- and bottom- ranked visual meme with net-\-txt 
feature, showing that the top memes are intuitively on-topic, 
while most of the low-ranked memes have no clear connection 
to the topic. Figure [9] also shows the positively and negatively 
correlated words to each of the prediction target. We can 
see that they are notably different from frequently-occurring 
words in either collection. Indicative words include those that 
indicate trustable authors (bbc), particular sub-events (riot), or 
video genre that engender participation (remix). On the other 
hand, certain frequent words such as watch, video, or swine 
flue, hlnl are shown to be non-informative. 

IX. Conclusions 

In this paper, we proposed visual memes for tracking and 
monitoring of real- world events on YouTube. We described a 
large-scale event-based social video monitoring and analysis 
system. We proposed a scalable algorithm for extracting visual 
memes with high accuracy. We applied visual memes for 
estimating influence and predicting content popularity using 



network models. Using our system, we have quantified per- 
centage of remixed content. We have observed the relationship 
between remix popularity and content views, and the timing of 
the remix. We have shown that memes can help quantify the 
roles different users groups play in propagating information. 
Furthermore, we have designed a cross-modal matching (CM 2 ) 
method for annotating the meanings of visual memes, and 
effectively predicted visual meme popularity with network and 
content features. Future work include exploring other appli- 
cations of cross-modal association in social media, examining 
visual meme sequences, and generalizing the observations and 
application of visual memes to video geners other than news. 
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Fig. 8. Meme popularity prediction performance using various network and content features (best viewed in color). Top: Iran3 dataset; bottom: Swineflu 
dataset. Prediction targets: meme volume ( # of vide os, left) and lifespan (days, right); performance metrics: M.SE (smaller is better) pearson correlation and 
Kendall's tau (larger is better). See Section |VHI-E| for discussions on various features. 
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